test_randoms
Basic Overview
Theory
Rarefraction curves
The x-axis of the plot shows the total number of sampled sequences, and the y-axis shows the total number of unique sequences.
The plot can be interpreted as follows:
- Samples with higher sequencing depth will have more unique sequences at each point on the x-axis.
- Samples with lower sequencing depth will have fewer unique sequences at each point on the x-axis.
This plot is useful for NGS because it can be used to assess the quality of the sequencing data and to determine whether the sequencing depth is sufficient for the desired application. For example, if you are trying to identify rare variants in a sample, you will need to have a high sequencing depth in order to be confident that the variants are real and not simply sequencing errors.
Here are some specific examples of how the plot could be used to interpret NGS data:
- If you are comparing two samples with different sequencing depths, you can use the plot to see which sample has more unique sequences at each point on the x-axis. This can help you to determine which sample is more likely to contain the variants of interest.
- If you are trying to determine whether the sequencing depth of a sample is sufficient for a particular application, you can use the plot to compare the sample to other samples that have been used successfully for the same application.
- If you are seeing a plateau in the number of unique sequences as the sequencing depth increases, this may indicate that you have reached a saturation point and that further sequencing will not yield much new information.
Alignment Quality Plot
The barplot shows the total sequenced reads and aligned reads for each sample. The total sequenced reads is the number of reads that were generated by the NGS sequencer. The aligned reads is the number of reads that were successfully mapped to the reference genome.
The plot shows that the total sequenced reads is higher than the aligned reads for all samples. This is because some of the reads may be of low quality or may not map to the reference genome.
Furthermore, it shows that the difference between the total sequenced reads and aligned reads is larger for some samples than others. This may be due to a number of factors, such as the quality of the DNA sample, the sequencing platform used, and the alignment parameters used.
Use of the barplot for NGS:
The barplot can be used to assess the quality of the NGS data and to determine whether the alignment rate is sufficient for the desired application.
Specific examples:
- If you are comparing two samples, you can use the plot to see which sample has a higher alignment rate. This can help you to determine which sample is more likely to contain the variants of interest.
- If you are trying to determine whether the alignment rate for a sample is sufficient for a particular application, you can compare the sample to other samples that have been used successfully for the same application.
- If you are seeing a low alignment rate for a sample, this may indicate that there is a problem with the DNA sample, the sequencing platform used, or the mixcr parameters used.
Heatmap based on Morosita-Horn Index
The heatmap shows a matrix which values are calculated based on the morosita horn index. This index captures the degree of identity between two samples which are composed of multiple values of different sizes. This is especially useful for comparing two samples which are composed of different numbers of clones and have a different number of aligned reads. In the context of this heatmap, identity means that two sequences have the same sequence length and the exact same amino acid sequence.
The heatmap has several possible applications during the quality control of the sequencing, which are: 1. identification of cross contamination between samples. This can be seen if a high degree of identity can be observed, although the samples were panned against different antigens. 2. Validation of the panning process. This can be seen if a high degree of identity can be observed between samples which were panned against the same antigen in different rounds.
Weaknesses: - the morosita horn becomes less accuracte if the number of elements (sequences) is very low in one of the samples
Alternatives:
The user can analyse the identity between the samples based on: - Jaccard Index - Sorensen-Dice Index
The content generated in this section was writen by Nils Hofmann. The author does not guarantee the correctness of the content.
Identity between samples
Identity based on Morosita Horn Index
Identity based on Jaccard Index
Identity based on Sorensen Index
Sequencing Quality of samples
General Sequencing Quality
Sample specific quality analysis
Quality analysis for random1_R1_001
Quality analysis for random2_R1_001
Quality analysis for random3_1_001
Quality analysis for random4_R1_001
Sequence Clusters
Theory
Clustering based on levenshtein distance
The Levenshtein distance is a metric that measures the similarity between two strings. It is calculated by counting the minimum number of edits (insertions, deletions, and substitutions) required to transform one string into the other.
For example, the Levenshtein distance between the strings “CAT” and “DOG” is 2, because we need to insert the letter “D” and substitute the letter “C” with the letter “G” to transform “CAT” into “DOG”.
The Levenshtein distance can be used to analyze peptide sequences in a variety of ways. For example, it can be used to:
- Identify similar CDR3 regions. This can be useful for identifying antibodies with similar antigen-binding specificities.
- Detect mutations in CDR3 regions. This can be useful for identifying mutations that may affect the affinity or specificity of an antibody.
- Cluster CDR3 regions into groups. This can be useful for identifying groups of CDR3 regions with similar antigen-binding specificities.
The connected components in the plot (shown by a a network of lines) are groups of CDR3 sequences that are similar to each other, based on a threshold for a levenshtein distance (default = 2). The size of each circle in the plot represents the number of clones for the corresponding CDR3 sequence relative to the clone counts of other sequences.
Overall, the plot can be used to get a general overview of the diversity and similarity of the CDR3 sequences in the library.
Furthermore, a table is given which has the sorted sequences based on the clone count. That means the top 10 sequences represent the sequences with the highest clone fraction in that sample. This table, coming from a report can be used to identify specificity for a certain antigen based on a density of these sequences in certain clusters.
Disadvantages of the levenshtein distance
- Only captures global similarity of sequences and does not focues on specific regions
- Does not take any chemical properties or structural properties into account
Sequence Cluster Dendrogram
The dendrogram you provided is a visualization of the similarity between a set of CDR3 heavy chain sequences. The sequences are clustered together based on their Levenshtein distance, which is described in the previous section.
The dendrogram can be used to understand the diversity of the CDR3 sequences in a sample. For example, if the dendrogram shows that most of the sequences are clustered together in a few large clusters, this suggests that there is a relatively low level of diversity in the sample. On the other hand, if the dendrogram shows that the sequences are spread out in many small clusters, this suggests that there is a high level of diversity in the sample.
The dendrogram can also be used to identify groups of similar sequences. For example, if we want to identify a group of sequences that are likely to have similar antigen-binding specificities, we can look for a cluster of sequences that are close to each other on the dendrogram.
Clustering based on t-SNE
Addressing the limitations of using the levenshtein distance for clustering hcdr3 sequences, a plot which shows the sequences and their whole sequence relation was created to enable a more extensive analysis of the sequences and their binding. To be able to have the sequences as vectors the SGT embedding was applied.
This embedding does not capture any chemical properties of the amino acid strand but instead tries to recognize the characteristic relative position of letters within a sequence which enables an identification of patterns between sequences of different lengths. This addresses the variability of the cdr3 sequences and enables to capture the similarity of sequences. Therefore, the sgt embedder was initialized using the package sgt where the parameter length sensitive was set to True and the parameter kappa was assigned with 1 . After the initialization, the embedding creates an output of 400 dimensions per sequence which were reduced to 80 dimensions by using principal component analysis (PCA).
Subsequently t-SNE was implemented to show a representative arrangement of the 80 dimensions in the two dimensional space because a further dimension reduction to only two dimensions would insufficiently describe the data globally by using PCA. That is why t-SNE was introduced to be able to show a representative arrangement of the points without losing too much of its information.
The plot can be used in variety of ways, such as: 1. Identify similar CDR3 regions 2. learn about the differences between samples 3. identify specific sequence differences for samples which were panned against different antigens.
The content generated in this section was writen by Nils Hofmann. The author does not guarantee the correctness of the content.
Levenshtein distance based
Sequence Similarity based on Levenshtein Distance of random1_R1_001
| Unnamed: 0 | Cluster No. | Sequences | |
|---|---|---|---|
| 0 | 0 | 0 | CNAVHSRWQAMTRW |
| 1 | 1 | 1 | CIQSGTDRR |
| 2 | 2 | 30 | CNVDLTVVDGRHLPRGDYW |
| 3 | 3 | 1 | CIQSGTSRR |
| 4 | 4 | 9 | CHADLRVRDGVRGDYW |
| 5 | 5 | 0 | CNAVHSRWQAMTHW |
| 6 | 6 | 4 | CAADLFGTRQADLLIYNFR |
| 7 | 7 | 5 | CNAVGADRISGVIYW |
| 8 | 8 | 6 | CATRFTTPWDARAPAYYNYW |
| 9 | 9 | 7 | CAAQVYTSGIYYYSGSYDYW |
Sequence Similarity based on Levenshtein Distance of random2_R1_001
| Unnamed: 0 | Cluster No. | Sequences | |
|---|---|---|---|
| 0 | 0 | 4 | CAADSPSLPDRTAAHSYDYAYW |
| 1 | 1 | 1 | CAAERVPSAIYTHYEFCVGADEYDYW |
| 2 | 2 | 2 | CAADLFGTRQADLLIYNFR |
| 3 | 3 | 3 | CAAGSDYNLDGSW |
| 4 | 4 | 4 | CAAESPDLVDRTATHSYDYSYW |
| 5 | 5 | 5 | CHADIRVRAGVRGDYW |
| 6 | 6 | 6 | CGADKFPYSAETMCSIPGGPDLDAW |
| 7 | 7 | 4 | CAAESPFLPDRTARHSYDYTYW |
| 8 | 8 | 7 | CNAVHSRWQAMTRW |
| 9 | 9 | 8 | CAADKFPYAAETMCVIPGGPDADTW |
Sequence Similarity based on Levenshtein Distance of random3_1_001
Sequence Similarity based on Levenshtein Distance of random4_R1_001
| Unnamed: 0 | Cluster No. | Sequences | |
|---|---|---|---|
| 0 | 0 | 7 | CAADSPSLPDRTAAHSYDYAYW |
| 1 | 1 | 1 | CNAVHSRWQAMTRW |
| 2 | 2 | 2 | CIQSGTSRR |
| 3 | 3 | 2 | CIQSGTDRR |
| 4 | 4 | 1 | CNAVHSRWQAMTHW |
| 5 | 5 | 3 | CNTEEESTGTYYEW |
| 6 | 6 | 7 | CAAESPDLVDRTATHSYDYSYW |
| 7 | 7 | 5 | CAAERVPSAIYTHYEFCVGADEYDYW |
| 8 | 8 | 6 | CAAGSDYNLDGSW |
| 9 | 9 | 2 | CIQSGTDRK |